seed entity
Ground-Truth Subgraphs for Better Training and Evaluation of Knowledge Graph Augmented LLMs
Cattaneo, Alberto, Luschi, Carlo, Justus, Daniel
Retrieval of information from graph-structured knowledge bases represents a promising direction for improving the factuality of LLMs. While various solutions have been proposed, a comparison of methods is difficult due to the lack of challenging QA datasets with ground-truth targets for graph retrieval. We present SynthKGQA, an LLM-powered framework for generating high-quality Knowledge Graph Question Answering datasets from any Knowledge Graph, providing the full set of ground-truth facts in the KG to reason over questions. We show how, in addition to enabling more informative benchmarking of KG retrievers, the data produced with SynthKGQA also allows us to train better models.We apply SynthKGQA to Wikidata to generate GTSQA, a new dataset designed to test zero-shot generalization abilities of KG retrievers with respect to unseen graph structures and relation types, and benchmark popular solutions for KG-augmented LLMs on it.
- Leisure & Entertainment > Games > Computer Games (0.68)
- Leisure & Entertainment > Sports > Soccer (0.68)
- Media > Film (0.68)
- Government (0.67)
BMGQ: A Bottom-up Method for Generating Complex Multi-hop Reasoning Questions from Semi-structured Data
Qiu, Bingsen, Liu, Zijian, Liu, Xiao, Wang, Bingjie, Zhang, Feier, Qin, Yixuan, Li, Chunyan, Yang, Haoshen, Gao, Zeren
Building training-ready multi-hop question answering (QA) datasets that truly stress a model's retrieval and reasoning abilities remains highly challenging recently. While there have been a few recent evaluation datasets that capture the characteristics of hard-to-search but easy-to-verify problems -- requiring the integration of ambiguous, indirect, and cross-domain cues -- these data resources remain scarce and are mostly designed for evaluation, making them unsuitable for supervised fine-tuning (SFT) or reinforcement learning (RL). Meanwhile, manually curating non-trivially retrievable questions -- where answers cannot be found through a single direct query but instead require multi-hop reasoning over oblique and loosely connected evidence -- incurs prohibitive human costs and fails to scale, creating a critical data bottleneck for training high-capability retrieval-and-reasoning agents. To address this, we present BMGQ, a bottom-up automated method for generating high-difficulty, training-ready multi-hop questions from semi-structured knowledge sources. The BMGQ system (i) grows diverse, logically labeled evidence clusters through Natural Language Inference (NLI)-based relation typing and diversity-aware expansion; (ii) applies reverse question construction to compose oblique cues so that isolated signals are underinformative but their combination uniquely identifies the target entity; and (iii) enforces quality with a two-step evaluation pipeline that combines multi-model consensus filtering with structured constraint decomposition and evidence-based matching. The result is a scalable process that yields complex, retrieval-resistant yet verifiable questions suitable for SFT/RL training as well as challenging evaluation, substantially reducing human curation effort while preserving the difficulty profile of strong evaluation benchmarks.
- North America > United States (0.14)
- Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
- Pacific Ocean (0.04)
- (3 more...)
- Transportation > Passenger (0.94)
- Transportation > Air (0.94)
- Consumer Products & Services > Travel (0.69)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- (2 more...)
Dialogue Benchmark Generation from Knowledge Graphs with Cost-Effective Retrieval-Augmented LLMs
Omar, Reham, Mangukiya, Omij, Mansour, Essam
Dialogue benchmarks are crucial in training and evaluating chatbots engaging in domain-specific conversations. Knowledge graphs (KGs) represent semantically rich and well-organized data spanning various domains, such as DBLP, DBpedia, and YAGO. Traditionally, dialogue benchmarks have been manually created from documents, neglecting the potential of KGs in automating this process. Some question-answering benchmarks are automatically generated using extensive preprocessing from KGs, but they do not support dialogue generation. This paper introduces Chatty-Gen, a novel multi-stage retrieval-augmented generation platform for automatically generating high-quality dialogue benchmarks tailored to a specific domain using a KG. Chatty-Gen decomposes the generation process into manageable stages and uses assertion rules for automatic validation between stages. Our approach enables control over intermediate results to prevent time-consuming restarts due to hallucinations. It also reduces reliance on costly and more powerful commercial LLMs. Chatty-Gen eliminates upfront processing of the entire KG using efficient query-based retrieval to find representative subgraphs based on the dialogue context. Our experiments with several real and large KGs demonstrate that Chatty-Gen significantly outperforms state-of-the-art systems and ensures consistent model and system performance across multiple LLMs of diverse capabilities, such as GPT-4o, Gemini 1.5, Llama 3, and Mistral.
- North America > Canada (0.04)
- North America > United States > New York > Tompkins County > Ithaca (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Asia > China > Liaoning Province > Shenyang (0.04)
- Research Report (0.64)
- Workflow (0.46)
Exploring the Implicit Semantic Ability of Multimodal Large Language Models: A Pilot Study on Entity Set Expansion
Wang, Hebin, Li, Yangning, Li, Yinghui, Zheng, Hai-Tao, Jiang, Wenhao, Kim, Hong-Gee
The rapid development of multimodal large language models (MLLMs) has brought significant improvements to a wide range of tasks in real-world applications. However, LLMs still exhibit certain limitations in extracting implicit semantic information. In this paper, we apply MLLMs to the Multi-modal Entity Set Expansion (MESE) task, which aims to expand a handful of seed entities with new entities belonging to the same semantic class, and multi-modal information is provided with each entity. We explore the capabilities of MLLMs to understand implicit semantic information at the entity-level granularity through the MESE task, introducing a listwise ranking method LUSAR that maps local scores to global rankings. Our LUSAR demonstrates significant improvements in MLLM's performance on the MESE task, marking the first use of generative MLLM for ESE tasks and extending the applicability of listwise ranking.
- Asia > China > Guangdong Province > Shenzhen (0.05)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- (3 more...)
KGPrune: a Web Application to Extract Subgraphs of Interest from Wikidata with Analogical Pruning
Monnin, Pierre, Nousradine, Cherif-Hassan, Jarnac, Lucas, Zuckerman, Laurel, Couceiro, Miguel
Knowledge graphs (KGs) have become ubiquitous publicly available knowledge sources, and are nowadays covering an ever increasing array of domains. However, not all knowledge represented is useful or pertaining when considering a new application or specific task. Also, due to their increasing size, handling large KGs in their entirety entails scalability issues. These two aspects asks for efficient methods to extract subgraphs of interest from existing KGs. To this aim, we introduce KGPrune, a Web Application that, given seed entities of interest and properties to traverse, extracts their neighboring subgraphs from Wikidata. To avoid topical drift, KGPrune relies on a frugal pruning algorithm based on analogical reasoning to only keep relevant neighbors while pruning irrelevant ones. The interest of KGPrune is illustrated by two concrete applications, namely, bootstrapping an enterprise KG and extracting knowledge related to looted artworks.
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Greece (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (11 more...)
- Information Technology > Knowledge Management > Knowledge Engineering (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Ontologies (0.47)
Seed-Guided Fine-Grained Entity Typing in Science and Engineering Domains
Zhang, Yu, Zhang, Yunyi, Shen, Yanzhen, Deng, Yu, Popa, Lucian, Shwartz, Larisa, Zhai, ChengXiang, Han, Jiawei
Accurately typing entity mentions from text segments is a fundamental task for various natural language processing applications. Many previous approaches rely on massive human-annotated data to perform entity typing. Nevertheless, collecting such data in highly specialized science and engineering domains (e.g., software engineering and security) can be time-consuming and costly, without mentioning the domain gaps between training and inference data if the model needs to be applied to confidential datasets. In this paper, we study the task of seed-guided fine-grained entity typing in science and engineering domains, which takes the name and a few seed entities for each entity type as the only supervision and aims to classify new entity mentions into both seen and unseen types (i.e., those without seed entities). To solve this problem, we propose SEType which first enriches the weak supervision by finding more entities for each seen type from an unlabeled corpus using the contextualized representations of pre-trained language models. It then matches the enriched entities to unlabeled text to get pseudo-labeled samples and trains a textual entailment model that can make inferences for both seen and unseen types. Extensive experiments on two datasets covering four domains demonstrate the effectiveness of SEType in comparison with various baselines.
- North America > United States > Illinois > Champaign County > Urbana (0.04)
- North America > United States > California > Santa Clara County > San Jose (0.04)
From Retrieval to Generation: Efficient and Effective Entity Set Expansion
Huang, Shulin, Ma, Shirong, Li, Yangning, Li, Yinghui, Jiang, Yong, Zheng, Hai-Tao, Shen, Ying
Entity Set Expansion (ESE) is a critical task aiming at expanding entities of the target semantic class described by seed entities. Most existing ESE methods are retrieval-based frameworks that need to extract contextual features of entities and calculate the similarity between seed entities and candidate entities. To achieve the two purposes, they iteratively traverse the corpus and the entity vocabulary, resulting in poor efficiency and scalability. Experimental results indicate that the time consumed by the retrieval-based ESE methods increases linearly with entity vocabulary and corpus size. In this paper, we firstly propose Generative Entity Set Expansion (GenExpan) framework, which utilizes a generative pre-trained auto-regressive language model to accomplish ESE task. Specifically, a prefix tree is employed to guarantee the validity of entity generation, and automatically generated class names are adopted to guide the model to generate target entities. Moreover, we propose Knowledge Calibration and Generative Ranking to further bridge the gap between generic knowledge of the language model and the goal of ESE task. For efficiency, expansion time consumed by GenExpan is independent of entity vocabulary and corpus size, and GenExpan achieves an average 600% speedup compared to strong baselines. For expansion effectiveness, our framework outperforms previous state-of-the-art ESE methods.
- Asia > China > Guangdong Province > Shenzhen (0.05)
- North America > United States > Nevada (0.05)
- North America > United States > Texas (0.05)
- (28 more...)
Relevant Entity Selection: Knowledge Graph Bootstrapping via Zero-Shot Analogical Pruning
Jarnac, Lucas, Couceiro, Miguel, Monnin, Pierre
Knowledge Graph Construction (KGC) can be seen as an iterative process starting from a high quality nucleus that is refined by knowledge extraction approaches in a virtuous loop. Such a nucleus can be obtained from knowledge existing in an open KG like Wikidata. However, due to the size of such generic KGs, integrating them as a whole may entail irrelevant content and scalability issues. We propose an analogy-based approach that starts from seed entities of interest in a generic KG, and keeps or prunes their neighboring entities. We evaluate our approach on Wikidata through two manually labeled datasets that contain either domain-homogeneous or -heterogeneous seed entities. We empirically show that our analogy-based approach outperforms LSTM, Random Forest, SVM, and MLP, with a drastically lower number of parameters. We also evaluate its generalization potential in a transfer learning setting. These results advocate for the further integration of analogy-based inference in tasks related to the KG lifecycle.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- Europe > Austria > Vienna (0.14)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- (25 more...)
Automatic Context Pattern Generation for Entity Set Expansion
Li, Yinghui, Huang, Shulin, Zhang, Xinwei, Zhou, Qingyu, Li, Yangning, Liu, Ruiyang, Cao, Yunbo, Zheng, Hai-Tao, Shen, Ying
Entity Set Expansion (ESE) is a valuable task that aims to find entities of the target semantic class described by given seed entities. Various Natural Language Processing (NLP) and Information Retrieval (IR) downstream applications have benefited from ESE due to its ability to discover knowledge. Although existing corpus-based ESE methods have achieved great progress, they still rely on corpora with high-quality entity information annotated, because most of them need to obtain the context patterns through the position of the entity in a sentence. Therefore, the quality of the given corpora and their entity annotation has become the bottleneck that limits the performance of such methods. To overcome this dilemma and make the ESE models free from the dependence on entity annotation, our work aims to explore a new ESE paradigm, namely corpus-independent ESE. Specifically, we devise a context pattern generation module that utilizes autoregressive language models (e.g., GPT-2) to automatically generate high-quality context patterns for entities. In addition, we propose the GAPA, a novel ESE framework that leverages the aforementioned GenerAted PAtterns to expand target entities. Extensive experiments and detailed analyses on three widely used datasets demonstrate the effectiveness of our method. All the codes of our experiments are available at https://github.com/geekjuruo/GAPA.
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > Georgia (0.14)
- North America > United States > Ohio (0.05)
- (42 more...)
- Education (0.94)
- Health & Medicine (0.93)
- Leisure & Entertainment (0.67)
Crawling the Internal Knowledge-Base of Language Models
Cohen, Roi, Geva, Mor, Berant, Jonathan, Globerson, Amir
Language models are trained on large volumes of text, and as a result their parameters might contain a significant body of factual knowledge. Any downstream task performed by these models implicitly builds on these facts, and thus it is highly desirable to have means for representing this body of knowledge in an interpretable way. However, there is currently no mechanism for such a representation. Here, we propose to address this goal by extracting a knowledge-graph of facts from a given language model. We describe a procedure for ``crawling'' the internal knowledge-base of a language model. Specifically, given a seed entity, we expand a knowledge-graph around it. The crawling procedure is decomposed into sub-tasks, realized through specially designed prompts that control for both precision (i.e., that no wrong facts are generated) and recall (i.e., the number of facts generated). We evaluate our approach on graphs crawled starting from dozens of seed entities, and show it yields high precision graphs (82-92%), while emitting a reasonable number of facts per entity.
- Europe > Italy (0.04)
- Europe > Austria > Vienna (0.04)
- Asia > Philippines (0.04)
- (12 more...)
- Government (0.93)
- Media (0.93)
- Leisure & Entertainment > Sports (0.93)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Expert Systems (0.71)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.55)